44 research outputs found
Convolutional Neural Networks Applied to House Numbers Digit Classification
We classify digits of real-world house numbers using convolutional neural
networks (ConvNets). ConvNets are hierarchical feature learning neural networks
whose structure is biologically inspired. Unlike many popular vision approaches
that are hand-designed, ConvNets can automatically learn a unique set of
features optimized for a given task. We augmented the traditional ConvNet
architecture by learning multi-stage features and by using Lp pooling and
establish a new state-of-the-art of 94.85% accuracy on the SVHN dataset (45.2%
error improvement). Furthermore, we analyze the benefits of different pooling
methods and multi-stage features in ConvNets. The source code and a tutorial
are available at eblearn.sf.net.Comment: 4 pages, 6 figures, 2 table
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
We present an integrated framework for using Convolutional Networks for
classification, localization and detection. We show how a multiscale and
sliding window approach can be efficiently implemented within a ConvNet. We
also introduce a novel deep learning approach to localization by learning to
predict object boundaries. Bounding boxes are then accumulated rather than
suppressed in order to increase detection confidence. We show that different
tasks can be learned simultaneously using a single shared network. This
integrated framework is the winner of the localization task of the ImageNet
Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very
competitive results for the detection and classifications tasks. In
post-competition work, we establish a new state of the art for the detection
task. Finally, we release a feature extractor from our best model called
OverFeat
Time-Contrastive Networks: Self-Supervised Learning from Video
We propose a self-supervised approach for learning representations and
robotic behaviors entirely from unlabeled videos recorded from multiple
viewpoints, and study how this representation can be used in two robotic
imitation settings: imitating object interactions from videos of humans, and
imitating human poses. Imitation of human behavior requires a
viewpoint-invariant representation that captures the relationships between
end-effectors (hands or robot grippers) and the environment, object attributes,
and body pose. We train our representations using a metric learning loss, where
multiple simultaneous viewpoints of the same observation are attracted in the
embedding space, while being repelled from temporal neighbors which are often
visually similar but functionally different. In other words, the model
simultaneously learns to recognize what is common between different-looking
images, and what is different between similar-looking images. This signal
causes our model to discover attributes that do not change across viewpoint,
but do change across time, while ignoring nuisance variables such as
occlusions, motion blur, lighting and background. We demonstrate that this
representation can be used by a robot to directly mimic human poses without an
explicit correspondence, and that it can be used as a reward function within a
reinforcement learning algorithm. While representations are learned from an
unlabeled collection of task-related videos, robot behaviors such as pouring
are learned by watching a single 3rd-person demonstration by a human. Reward
functions obtained by following the human demonstrations under the learned
representation enable efficient reinforcement learning that is practical for
real-world robotic systems. Video results, open-source code and dataset are
available at https://sermanet.github.io/imitat
Going Deeper with Convolutions
We propose a deep convolutional neural network architecture codenamed
"Inception", which was responsible for setting the new state of the art for
classification and detection in the ImageNet Large-Scale Visual Recognition
Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the
improved utilization of the computing resources inside the network. This was
achieved by a carefully crafted design that allows for increasing the depth and
width of the network while keeping the computational budget constant. To
optimize quality, the architectural decisions were based on the Hebbian
principle and the intuition of multi-scale processing. One particular
incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22
layers deep network, the quality of which is assessed in the context of
classification and detection